Skip to content

feat: support appstream data keywords and categories for FTS#204

Open
m1rm wants to merge 18 commits intomainfrom
feat/integrate-appstream-data
Open

feat: support appstream data keywords and categories for FTS#204
m1rm wants to merge 18 commits intomainfrom
feat/integrate-appstream-data

Conversation

@m1rm
Copy link
Copy Markdown
Collaborator

@m1rm m1rm commented Apr 12, 2026

TL;DR

Integrates Arch Linux AppStream metadata into the Go app: keywords and categories columns on package, filled from upstream Components-x86_64.xml.gz on sources.archlinux.org, exposed to FTS5 search with tuned BM25 weights. Adds a update-appstream CLI command and just update-appstream / inclusion in just update-data.

Motivation

Improve package search (including German-oriented terms) using AppStream and data.
Keep the implementation streaming and aligned with existing update jobs (pacmandb-style callback parsing).

Data source & versioning

Behaviour

What is indexed

  • Keywords: text from only (not name/summary/description).
  • Categories: text from only (not pacman groups).
  • Language:
    • blocks without xml:lang (neutral),
    • en / de (including BCP47 prefixes like de-DE),
    • and the same rules on individual / when present.
  • Stopwords: English + German closed-class words stripped in dedupeWords before storage (dedupe is case-insensitive)

Database & search

  • Migrations add keywords and categories on package, extend package_fts with matching columns, rebuild after changes.
  • update-packages upserts do not overwrite keywords / categories (same pattern as popularity).
  • Search uses BM25 with higher weight on name/description than on keywords / categories to limit dilution from AppStream text.

Operations

  • go run . update-appstream (requires DATABASE; optional APPSTREAM_SOURCES_BASE).
  • just update-appstream; just update-data runs update-appstream after update-packages.

Testing

  • Unit tests for XML parsing (keywords, categories, xml:lang on blocks and elements), keywordLangAccepted, stopwords, dedupeWords.
  • Tests updated for new FTS column list; just test / just lint pass.

Follow-ups (optional)

  • If we chose to keep both keywords and categories, we could merge the migrations into one single migration
  • Surface categories in the package detail UI if desired.
  • Revisit BM25 weights after production metrics.

@m1rm m1rm self-assigned this Apr 12, 2026
@m1rm m1rm marked this pull request as draft April 12, 2026 13:25
@m1rm m1rm changed the title feat: initial implementation for appstream data fetcher in Golang feat: support appstream data keywords and categories for FTS Apr 12, 2026
@m1rm m1rm marked this pull request as ready for review April 12, 2026 14:26
@m1rm m1rm force-pushed the feat/integrate-appstream-data branch from 01c46f3 to 9157312 Compare April 12, 2026 14:36
m1rm and others added 16 commits April 13, 2026 11:49
…a (summary, description etc.) was too messy and thus messed up search rankings
…est into main exists and pushes are added to branch with open PR

piggyback: improve ci setup; account for duplicate runs

tryout to fix duplicate ci runs the other way
flush already hands ownership off to the caller (p.cur is nilled), so
append-clone of keywords/categories was pure overhead per <component>.
Strict=false hid real malformed-input errors; the upstream feed is
well-formed XML.
Two parallel maps keyed by pkgname plus a third union map to iterate
was wasted memory and an extra pass. One map of {kw, cat} slices covers
it.
The http.Client parameter on Update was always passed nil from main; the
fallback client timeouts (15m / 2m) were pure dead code since the ctx
deadline (10m in runCommand) always wins. Match the convention of the
other update commands: no client param, no hand-rolled timeouts. Also
unexport latestRelease — it's not called from outside the package.
*gzip.Reader already satisfies io.Reader.
Replace the manual stack + 6 skip flags with the decoder's own Skip():
when a <keywords>/<keyword>/<categories>/<category> tag has a
non-en/de xml:lang, skip its entire subtree in one call. The remaining
state is five booleans tracking the enter/leave of accepted elements.

No behavior change; all existing tests pass.
@pierres pierres force-pushed the feat/integrate-appstream-data branch from 9157312 to 3c5646d Compare April 13, 2026 10:25
pierres added 2 commits April 13, 2026 12:28
Upstream archlinux-appstream-data publishes core/extra/multilib only,
so core-testing/extra-testing/multilib-testing packages never get
keywords or categories columns populated. Document the asymmetry so
search-ranking surprises don't lead down a wrong debugging path.
Extract the DB-facing portion of Update into applyTerms so tests can
drive it directly against in-memory SQLite without mocking the HTTP
fetches. Covers:

- Keywords + categories land on matching package rows; non-mentioned
  packages stay empty.
- FTS matches on both the new keyword and category columns after the
  rebuild — catches column-order drift between the schema and the query.
- A second run clears stale data from rows no longer in the accumulator.
- Duplicates and stopwords are stripped by dedupeWords.
- All-stopword input does not update the row at all.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants